Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems
نویسندگان
چکیده
Blocking coordinated checkpointing is a well-known method for achieving fault tolerance in cluster computing systems. In this work, we introduce a new approach for blocking coordinated checkpointing using two-level checkpointing. The first level of checkpointing is local checkpointing, and computing nodes save the checkpoints in local disk. If a transient failure occurs in the computing node, the process can recover from local disk. Second level of checkpointing is global checkpointing and computing nodes send their checkpoints to highly reliable global stable storage. If a permanent failure occurs in the computing node, it can not be used and the process can recover from global storage in a new computing node. Local checkpoints are taken more frequently than global checkpoints. Also, in the end of each local checkpointing interval, the system determines the expected recovery time in the case of permanent failure and adaptively takes a global checkpoint, or skips. Experimental results show that average execution time of NAS-BT application is significantly reduced by using the proposed method. Maximum reduction of execution time of this application is 38%.
منابع مشابه
An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment
Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...
متن کاملBlocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols
A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPIs has led to the development of several fault tolerant MPI environments. Different approaches a...
متن کاملOn the Impossibility of Min-Process Non-Blocking Checkpointing and An Efficient Checkpointing Algorithm for Mobile Computing Systems
Mobile computing raises many new issues, such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Prakash and Singhal [14] proposed the first coordinated checkpointing algorithm for mobile computing systems. However, we showed that their algorithm may result in an inconsiste...
متن کاملA New High Performance Checkpointing Approach for Mobile Computing Systems
In this paper, we present a single phase non-blocking coordinated checkpointing algorithm suitable for mobile computing environments. The distinct advantages that make the proposed algorithm suitable for distributed mobile computing systems are the following. It produces a consistent set of checkpoints, without the overhead of taking temporary checkpoints; the algorithm makes sure that only min...
متن کاملAnti-message Logging Based Coordinated Checkpointing Protocol for Deterministic Mobile Computing Systems
A checkpoint algorithm for mobile computing systems needs to handle many new issues like: mobility, low bandwidth of wireless channels, lack of stable storage on mobile nodes, disconnections, limited battery power and high failure rate of mobile nodes. These issues make traditional checkpointing techniques unsuitable for such environments. Minimum-process coordinated checkpointing is an attract...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Inf. Sci. Eng.
دوره 26 شماره
صفحات -
تاریخ انتشار 2010